Will they like it? Will they really, really like it?

In this project, we develop a multiple linear regression model for predicting how audiences will score a movie on an online rating system – in this case, Rotten Tomatoes – based on the known value of other parameters such as the film’s genre, critical reception, or year of release.

We train our model on the movies dataset: a somewhat cleaned up sample compiled and provided through the “Linear Regression and Modeling” MOOC, one of the sequence of courses offered by Duke University through the Coursera platform as “Statistics with R”, developed and taught by Mine Cetinkaya-Rundel and others.


Getting Set Up

Loading packages and data

The R libraries used for this project pull from the standard tidyverse packages dplyr (for data handling) and ggplot2 with RColorBrewer (for visualization), along with the MOOC-course-specific statsr package that also hosts the movies dataset on which the analysis is performed. The knitr and kableExtra packages allow for a more controlled rendering of tables.

# Load libraries

library(dplyr)
library(statsr)
library(ggplot2)
library(knitr)
library(kableExtra)
library(RColorBrewer)

# Load dataset

load("movies.Rdata")

Part 1: The Movies Dataset

The movies dataset collects observations over 32 variables for a randomly sampled set of 651 movies produced and released between 1970 and 2014. (As the histogram of thtr_rel_year below shows, however, the sample skews toward more recent films.)

# Plot a histogram for the count of movies by year of release date, binwidth set so each year gets its own bar.

pal <- brewer.pal("Blues", n = 5)  # picks an RColorBrewer basic color palette

# Set up plot

p <- ggplot(data = movies, aes(x = thtr_rel_year))
gp <- geom_histogram(binwidth = 1, alpha = 0.4, color = pal[5], fill = pal[4])
th <- theme_minimal()
l <- labs(x = NULL, y = NULL, title = "Movies by Year of Theatrical Release")

# Generate plot

p + gp + th + l

The movies dataset is observational: it consists of data collected after the fact, not through a planned experimental design with random assignment and controls. So while we can address the strength and predictive significance of correlations among variables (such as running time and audience score), we will not be able to speak to issues of causality (i.e., whether audiences like a movie more or less because of its length).

The dataset has some areas of concern, some of which we don’t have enough information to reconcile here. We do know that the overall sample size of 651 movies is large enough to allow for conclusions about correlations at a high level of confidence, even at the many degrees of freedom we’ll be working with in a multiple variable model.

But we’ll want to be careful to consider how many ratings a given movie’s score is based on – we have this measure for IMDB votes, but not for the Rotten Tomatoes ratings, and we’ll be taking a leap of faith here to assume that the number of ratings is comparable. As it stands, the minimum number of IMDB votes for movies included in our dataset sample is a sufficiently large 180, with the median at 15,116 votes.

# Basic summary statistics for the `imdb_num_votes` dimension (the number of votes contributing to IMDB audience ratings).

votes <- movies %>%
  summarise("minimum" = min(imdb_num_votes),
            "median" = median(imdb_num_votes),
            "mean" = mean(imdb_num_votes),
            "maximum" = max(imdb_num_votes))

# Present those stats in a styled, compact, and left-aligned table

kable((votes), 
      caption = "IMDB Votes per Movie", 
      digits = 0, 
      format.args = list(big.mark = ",")) %>%
  kable_styling(full_width = FALSE, position = "left")
IMDB Votes per Movie
minimum median mean maximum
180 15,116 57,533 893,008

Linear regression models also rely for their validity on the independence of observations. And we might well question whether that condition holds here. We’d have to assume, for example, that individuals contributing online ratings to Rotten Tomatoes or IMDB are self-selected and, by definition, engaged participants. And we can’t know from our dataset or its codebook whether any particular individual may have contributed more than one rating – or potentially many, many ratings. Such “super-users” might especially impact the analysis of the less-represented genres: one especially active Grido-shot-first-er might skew the entire nine-film “Science Fiction & Fantasy” subset.


Part 2: Research questions

Let’s start by exploring three basic questions about the trustworthiness of our basic response measure, how that measure correlates to other potential explanatory variables, and how that correlation might hold across subsets of the data.

Question One: Do Rotten Tomato Audience Ratings Provide a Useful Measure?

How well do the ratings systems correlate? The movies dataset offers a few possible response variables around which a predictive linear model for a movie’s success could be built. Most obviously, it provides measures drawn from two well-known online sites: IMDB and Rotten Tomatoes. IMDB provides an average of audience ratings on a 10-point scale, and Rotten Tomatoes provides both categorical ratings (“Rotten,” “Fresh,” “Certified Fresh”) and numerical scores that represent the percentage of positive reviews by audiences and critics.

A strong correlation between systems, such as between IMDB and Rotten Tomatoes audience ratings, would give us some confidence in using either as a useful proxy for the other.

Question Two: What Are Our Best Explanatory Variables?

What other factors will predict online audience ratings? What potential explanatory variables best correlate with our target response variable of the 0- to 100-point Rotten Tomatoes audience score for the purpose of developing a simple yet robust mulitple linear regression model? Are any of these surprisingly significant, either in terms of the strength of the correlation or its direction?

Question Three: Does Genre Matter?

Will the model we build work equally well for different kinds of movies? Audiences seek out and enjoy different types of movies for different reasons. The predictors for a brain-teasing documentary’s success with audiences might be (or might not be!) pretty different from the predictors for a blood-pumping action & adventure movie. Will our model be equally useful across genres? Or would it make more sense to build different models for each type of film?

Note that this is a question with real-world implications: in order to maximize a film’s online reputation (key for a movie’s post-theatrical life), would our model suggest that the producers/marketers of a comedy like Lady Bird be advised to put their efforts toward factors different from those that would be effective for a drama like Gotti?


Part 3: Exploratory data analysis

Do Rotten Tomato Audience Ratings Provide a Useful Measure?

To begin, we’ll look at the correlation between the Rotten Tomatoes audience_score and the IMDB imdb_rating – both online rating systems, though with somewhat different purposes and levels of irreverence. Doing so should help us get a sense of whether there’s some external validation for using the Rotten Tomatoes score as our main measure of a movie’s success with engaged audiences.

# Plot the correlation of IMDB and Rotten Tomatoes audience scores.
# Include a least-squares regression line to show any linear relation

p <- ggplot(data = movies, aes(x = imdb_rating, y = audience_score)) 
gp <- geom_jitter(color = pal[5], alpha = 0.5) # RColorBrewer "Blues" palette defined earlier
th <- theme_minimal()
l <- labs(x = "IMDB Avg. Rating", y = "Rotten Tomatoes Audience Score", 
       title = "Online Audience Ratings Correlation",
       subtitle = "(with linear model line)") 
lm <- geom_smooth(method = "lm", color = "black", fill = pal[5]) 

p + gp + th + l + lm

# Provide the summary output of a model of that linear relationship

summary(lm(data = movies, audience_score ~ imdb_rating))
## 
## Call:
## lm(formula = audience_score ~ imdb_rating, data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.800  -6.567   0.649   5.689  52.896 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -42.3284     2.4183  -17.50   <2e-16 ***
## imdb_rating  16.1234     0.3674   43.89   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.16 on 649 degrees of freedom
## Multiple R-squared:  0.748,  Adjusted R-squared:  0.7476 
## F-statistic:  1926 on 1 and 649 DF,  p-value: < 2.2e-16

There is a linear, positive, and strong correlation between these two measures, with the line for the relationship defined by the equation:

\[\widehat{audience\_score} = -42.33 + 16.12{imdb\_rating}\] Each point on the 10-point imdb_rating scale is associated with a positive 16.12 increase within the 100-point Rotten Tomatoes audience_score scale; and the association is a statistically significant predictor, at close to a p-value of 0.

Though there is some skew (and greater variability) toward the lower end of these scores, the overall \(R^2\) value of 0.75 (along with the large sample size) suggests we could confidently predict, and within a fairly tight confidence interval, the audience rating for one of these systems given the film’s rating in the other.

It also suggests that the imdb_rating will be one explanatory variable to exclude from the predictive model we’ll develop, given that its collinearity with audience_score would make its inclusion redundant.

What Are Our Best Explanatory Variables?

Next, we’ll consider the explanatory power of variables we might want to include in our predictive model.

Some variables are more obviously predictive than others. For example, the Rotten Tomatoes critics score:

# Create single-variable linear model to calculate R squared value

cl <- lm(data = movies, audience_score ~ critics_score)
cs <- summary(cl)
cr2 <- cs$r.squared

# Scatterplot with least-squares line (including standard error shading)

p <- ggplot(movies, aes(x = critics_score, y = audience_score)) 
l <- labs(x = "Rotten Tomatoes Critics' Score", y = "Rotten Tomatoes Audience Score",
       title = "Single Variable Correlation of Critics and Audience Scores")

p + gp + th + l + lm

As a potential predictor for audience_score, the critics_score will play a useful role in our model, the two variables demonstrating a clearly linear and positive relation in the scatter plot, with a tight error interval shown in the shading around the least-squares line, and an \(R^2\) value of 0.496.

Other variables, less so. Take thtr_rel_day, a variable whose value takes the day of the month (1 through 31) on which a film was released in theaters. The random spray of its scatter and its \(R^2\) value of 0 suggests that the day of the month predicts very little about audience_score:

# Scatterplot with least-squares line

p <- ggplot(movies, aes(x = thtr_rel_day, y = audience_score))
l <- labs(x = "Day of Month of Theatrical Release", y = "Rotten Tomatoes Audience Score",
       title = "Single Variable Correlation of Release Date and Audience Scores")

p + gp + th + l + lm

Finally, the explanatory power of some variables might be complicated by the clustered nature of the object under study: people value different kinds of movies for different reasons. We can see this in the differences in both the median audience_score and that score’s apparent variability when broken out by genre:

# Create boxplot of audience score broken out by genre

p <- ggplot(data = movies, aes(x = genre, y = audience_score, fill = genre))
gb <- geom_boxplot(alpha = 0.6) 
th <- theme_minimal()
tx <- theme(axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank())
col <- scale_fill_brewer(palette = "Set3")
l <- labs(title = "Rotten Tomatoes Audience Score by Genre", fill = "Genre")

p + gb + th + tx + col + l

Note the striking disparity of the median average score between, say, documentary and horror. But note, also, the disparity of the interquartile range for the various genres (which is to say, the range within which half of all audience scores fall). For musical & performing arts films, not only is the median audience score high, but most scores fell fairly close to this point. In contrast, scores for sci-fi and fantasy films are all over the place.

A glance at the standard deviation of the audience_score measure within each genre demonstrates the extent of this issue:

# Notice the wide disparity among standard deviations from genre to genre

gsd <- movies %>% group_by(genre) %>% 
  summarise(count = n(), sd = round(sd(audience_score),1)) %>% 
  arrange(desc(sd))

# Now show it!

gsd %>%
  kable(col.names = c("Movie Genre", "# of Films", "Standard Deviation"),
        caption = "Standard Dev. of Rotten Tomatoes audience score, by Genre") %>%
  kable_styling(bootstrap_options = c("striped", "condensed", "hover"),
                full_width = FALSE,
                position = "left")
Standard Dev. of Rotten Tomatoes audience score, by Genre
Movie Genre # of Films Standard Deviation
Science Fiction & Fantasy 9 27.2
Other 16 21.6
Action & Adventure 65 19.8
Animation 9 19.6
Mystery & Suspense 59 19.3
Comedy 87 19.2
Art House & International 14 18.8
Drama 305 18.5
Horror 23 16.1
Musical & Performing Arts 12 11.4
Documentary 52 8.4

The question would be whether predictive variables in one genre would work as well in another. For example, while a best_actress_win might be explanatory for an art house film, it might not be for a horror movie (and wouldn’t apply at all for a documentary).


Part 4: Modeling

Some potential explanatory variables we can dispense with immediately in building a model predictive of audience score: variables that are primarily about identifying the record (title, thtr_rel_month, thtr_rel_day, dvd_rel_month, dvd_rel_day, imdb_url, rt_url), variables with so many levels as to leave too few films in each category (studio, director, actor1 through actor5), and variables too collinear to be interesting (imdb_rating and the Rotten Tomatoes categorical audience_rating; and likewise the Rotten Tomatoes categorical critics_rating as largely duplicative of the numerical critics_score).

This leaves us with audience_score as our response variable and with the following explanatory variables: title_type (documentary, feature film, tv movie), genre, runtime, mpaa_rating, thtr_rel_year, dvd_rel_year, best_pic_nom, best_pic_win, best_actor_win, best_actress_win, best_dir_win, imdb_num_votes, critics_score, and top200_box.

Full Model

We’ll build our full version of the multilinear model around these fifteen dimensions. From there, in the interest of creating the most practical, parsimonious model to guide the efforts of producers, marketers, streaming-service content buyers, or other stakeholders, we’ll build up a smaller model through forward selection, using those variables that best improve the models’s predictive power.

In the interest of creating a model that is also human readable, we’ll perform transformations on three of our dimensions. Given the broad, exponential range of the number of votes per movie on IMDB that we saw earlier, we’ll use the base 10 log of the imdb_num_votes variable: this will allow us to understand, using the linear model’s coefficient for that variable, the impact of that coefficient on audience_score as the number of votes scale up from 1000 to 10,000 to 100,000. And to bring each movie’s theatrical and DVD release year into the scale of decades rather than millennia, both thtr_rel_year and dvd_rel_year will be based on their distance from the earliest year on each scale rather than on their absolute Anno Domini value.

NOTE: There are 8 films that have no values for dvd_rel_year and 1 with no value for runtime in the original movies dataset. We omit those observations for our modeling going forward.

# Create model dataset for 15 remaining dimensions, omitting observations with missing information

m <- na.omit(movies %>% 
               select(audience_score, title_type, genre, runtime, mpaa_rating,
                      imdb_num_votes, critics_score,
                      best_pic_nom, best_pic_win, best_actor_win, 
                      best_actress_win, best_dir_win,
                      top200_box, thtr_rel_year, dvd_rel_year)
             )

# Transform IMDB votes and DVD and theatrical release year to be human readable in the model

dvd_min <- min(m$dvd_rel_year)      # 1991
thtr_min <- min(m$thtr_rel_year)    # 1972

m <- m %>%
  mutate(log10_imdb_votes = log10(imdb_num_votes),
         thtr_year = thtr_rel_year - thtr_min,
         dvd_year = dvd_rel_year - dvd_min)

# Create final dataset with transformed variables replacing originals

m <- m %>% 
  select(audience_score, title_type, genre, runtime, mpaa_rating, 
         log10_imdb_votes, critics_score,
         best_pic_nom, best_pic_win, best_actor_win, best_actress_win, 
         best_dir_win, top200_box, thtr_year, dvd_year)

Now, we can build a linear model based on the full range of variables in our final dataset:

# Build a full linear model using all fifteen possible explanatory variables

lm_full <- lm(data = m, audience_score ~ .)

# Show the summary outpt of this model

summary(lm_full)
## 
## Call:
## lm(formula = audience_score ~ ., data = m)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.864  -7.616   0.067   8.208  46.238 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     14.930379   7.374144   2.025 0.043332 *  
## title_typeFeature Film          -6.843783   4.947156  -1.383 0.167053    
## title_typeTV Movie             -10.860270   7.642672  -1.421 0.155824    
## genreAnimation                   6.967361   5.166805   1.348 0.178000    
## genreArt House & International  15.154643   4.140733   3.660 0.000274 ***
## genreComedy                      2.776957   2.226423   1.247 0.212774    
## genreDocumentary                16.524661   5.214744   3.169 0.001607 ** 
## genreDrama                       7.247119   1.998702   3.626 0.000312 ***
## genreHorror                     -5.509668   3.289967  -1.675 0.094505 .  
## genreMusical & Performing Arts  17.086074   4.490021   3.805 0.000156 ***
## genreMystery & Suspense         -0.453022   2.483335  -0.182 0.855310    
## genreOther                       5.397964   3.750503   1.439 0.150586    
## genreScience Fiction & Fantasy  -3.120929   4.901145  -0.637 0.524509    
## runtime                          0.006771   0.032252   0.210 0.833785    
## mpaa_ratingNC-17                -8.900363   9.914502  -0.898 0.369692    
## mpaa_ratingPG                   -1.699961   3.711308  -0.458 0.647080    
## mpaa_ratingPG-13                -4.003895   3.883796  -1.031 0.302983    
## mpaa_ratingR                    -1.699541   3.734412  -0.455 0.649197    
## mpaa_ratingUnrated               3.151844   4.378430   0.720 0.471887    
## log10_imdb_votes                 8.911483   1.002600   8.888  < 2e-16 ***
## critics_score                    0.372703   0.022901  16.275  < 2e-16 ***
## best_pic_nomyes                  7.599697   3.393776   2.239 0.025494 *  
## best_pic_winyes                 -4.247395   5.943692  -0.715 0.475125    
## best_actor_winyes               -2.515969   1.554319  -1.619 0.106027    
## best_actress_winyes             -2.571736   1.718174  -1.497 0.134964    
## best_dir_winyes                 -2.055904   2.257971  -0.911 0.362912    
## top200_boxyes                   -1.249142   3.594872  -0.347 0.728351    
## thtr_year                       -0.100360   0.075046  -1.337 0.181618    
## dvd_year                        -0.503174   0.154245  -3.262 0.001167 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.98 on 613 degrees of freedom
## Multiple R-squared:  0.6029, Adjusted R-squared:  0.5848 
## F-statistic: 33.25 on 28 and 613 DF,  p-value: < 2.2e-16

As a whole, the \(R^2\) and \(R^2adj.\) values tell us that nearly 60% of the variation in our response variable audience_score can be explained by the variation in our other explanatory variables.

As we saw before, critics_score is a good predictor of audience_score. A less expected correlation is with the variable with the second-lowest p-value in the full model: the number of votes on IMDB (but perhaps suggesting that voting at all has some relationship to positive feelings about a movie).

There is evidence of collinearity among variables. Most obviously, while a best-picture Oscar nomination for has a positive correlation associated with audience_score in this model (+7.60), a best-picture win has negative association (-4.25). On its face, this seems odd! By definition, these are collinear measure since a film cannot win an Oscar without first being nominated. So there will be some things to tease out in refining our model through stepwise selection.

Refining the Model

Next, we start over from scratch to build up a simpler, more operationally practical model based on forward selection – testing explanatory values one at a time, determining the variable that adds the most value to the model at each stage, then repeating the process until adding any new variables does not improve the model. For our purposes, that “until” will be based on the point where we see no improvement to the \(R^2 adj.\) value; this is the point at which the predictive value added by a variable is outweighed by the sacrifice we’d be making in terms of our model’s degrees of freedom.

The process will look like this:

# Add one variable at a time to the current zero-variable model and track the best adjusted R-squared value

summary(lm_0 <- lm(data = m, audience_score ~ title_type))
## 
## Call:
## lm(formula = audience_score ~ title_type, data = m)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -49.634 -14.634   2.366  15.366  36.366 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              83.269      2.662  31.278  < 2e-16 ***
## title_typeFeature Film  -22.635      2.778  -8.148 1.95e-15 ***
## title_typeTV Movie      -26.469      8.989  -2.945  0.00335 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.2 on 639 degrees of freedom
## Multiple R-squared:  0.09467,    Adjusted R-squared:  0.09184 
## F-statistic: 33.41 on 2 and 639 DF,  p-value: 1.583e-14

… though in the interest of space and sanity, we’ll do this cooking-show style and simply reveal the top result for each round as we go.

In our first round, we find that the Rotten Tomatoes critics_score serves as the single best explanatory variable for our model (result: \(R^2adj. = 0.4895\)). In other words, the responses of both critics and audiences on Rotten Tomatoes are fairly well (but not perfectly) aligned.

summary(lm_1 <- lm(data = m, audience_score ~ 
                     critics_score))$adj.r.squared
## [1] 0.4895481

Round two adds genre to the model (result: \(R^2adj. = 0.5142\)). The genres Animation, House & International, Documentary, Drama, Musical & Performing Arts, and Other correlate positively with audience_score, while Comedy, Horror, Mystery & Suspense, and Science Fiction & Fantasy correlate negatively.

summary(lm_2 <- lm(data = m, audience_score ~ 
                     critics_score + genre))$adj.r.squared
## [1] 0.5141772

Next, we add the base 10 log of the number of votes on IMDB (log10_imdb_votes) to the model (result: \(R^2adj. = 0.5644\))

summary(lm_3 <- lm(data = m, audience_score ~ 
                     critics_score + genre + log10_imdb_votes))$adj.r.squared
## [1] 0.5644097

Then: the year of the movie’s DVD release, relative to the earliest DVD-release year in the dataset (dvd_year) (result: \(R^2adj. = 0.5803\))

summary(lm_4 <- lm(data = m, audience_score ~ 
                     critics_score + genre + log10_imdb_votes + 
                     dvd_year))$adj.r.squared
## [1] 0.5802862

From this point forward, we are getting into additions to the model that are generating only marginal improvements. Under some – maybe most – circumstances, we might choose to stop here, ending up with a fairly parsimonious model in which just 4 explanatory variables do the heavy lifting and would be easily comprehended by most human stakeholders: a Rotten Tomatoes audience score can be predicted with some degree of confidence if we know how critics scored it, how many votes it has received on IMDB, and its genre and year of DVD release.

However, in our current context of learning to better understand building predictive linear regression models through forward selection, we’ll continue to the bitter end!

The next round reveals that adding the a movie’s mpaa_rating (G, PG, PG-13, R, NC-17, Unrated) improves the model slightly (result: \(R^2adj. = 0.5818\)). (All ratings other than G and Unrated correlate negatively with audience score.)

summary(lm_5 <- lm(data = m, audience_score ~ 
                     critics_score + genre + log10_imdb_votes + 
                     dvd_year + mpaa_rating))$adj.r.squared
## [1] 0.5818098

…As does best_pic_nom – an Oscar nomination for Best Picture (result: \(R^2adj. = 0.5833\)).

summary(lm_6 <- lm(data = m, audience_score ~ 
                     critics_score + genre + log10_imdb_votes + 
                     dvd_year + mpaa_rating + best_pic_nom))$adj.r.squared
## [1] 0.5833209

The presence of an Oscar-winning actress (best_actress_win, not necessarily for the current movie) also improves the model (result: \(R^2adj. = 0.5846\)).

summary(lm_7 <- lm(data = m, audience_score ~ 
                     critics_score + genre + log10_imdb_votes + 
                     dvd_year + mpaa_rating + best_pic_nom + 
                     best_actress_win))$adj.r.squared
## [1] 0.5846128

…As does best_actor_win (result: \(R^2adj. = 0.5856\))

summary(lm_8 <- lm(data = m, audience_score ~ 
                     critics_score + genre + log10_imdb_votes + 
                     dvd_year + mpaa_rating + best_pic_nom + 
                     best_actress_win + best_actor_win))$adj.r.squared
## [1] 0.5856107

Further marginal improvement is provided by including the year of the movie’s theatrical release, relative to the earliest theatrical-release year in the dataset (thtr_year) (result: \(R^2adj. = 0.5859\))

summary(lm_9 <- lm(data = m, audience_score ~ 
                     critics_score + genre + log10_imdb_votes + 
                     dvd_year + mpaa_rating + best_pic_nom + 
                     best_actress_win + best_actor_win + thtr_year))$adj.r.squared
## [1] 0.585879

We add title_type (Documentary, Feature Film, TV Movie) (result: \(R^2adj. = 0.5862\))

summary(lm_10 <- lm(data = m, audience_score ~ 
                     critics_score + genre + log10_imdb_votes + 
                     dvd_year + mpaa_rating + best_pic_nom + 
                     best_actress_win + best_actor_win + thtr_year +
                      title_type))$adj.r.squared
## [1] 0.5861872

And the final round for which a variable improves the \(R^2adj.\) measure adds the slight improvement provided by best_dir_win (the presence of an Oscar-winning director).

Here’s the full summary of our final optimized model, with an \(R^2adj.\) value of \(0.5864\).

# Build the model for round eleven

lm_11 <- lm(data = m, audience_score ~ 
                     critics_score + genre + log10_imdb_votes + 
                     dvd_year + mpaa_rating + best_pic_nom + 
                     best_actress_win + best_actor_win + thtr_year +
                      title_type + best_dir_win)

# Save it as the final model and provide the model's summary output.

lm_final <- lm_11
summary(lm_final)
## 
## Call:
## lm(formula = audience_score ~ critics_score + genre + log10_imdb_votes + 
##     dvd_year + mpaa_rating + best_pic_nom + best_actress_win + 
##     best_actor_win + thtr_year + title_type + best_dir_win, data = m)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.221  -7.608  -0.096   8.302  46.369 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     15.60987    6.99833   2.231 0.026073 *  
## critics_score                    0.37278    0.02283  16.332  < 2e-16 ***
## genreAnimation                   7.08664    5.13078   1.381 0.167718    
## genreArt House & International  15.16113    4.13032   3.671 0.000263 ***
## genreComedy                      2.72467    2.20761   1.234 0.217593    
## genreDocumentary                16.50157    5.19918   3.174 0.001579 ** 
## genreDrama                       7.30668    1.98175   3.687 0.000247 ***
## genreHorror                     -5.53661    3.26480  -1.696 0.090419 .  
## genreMusical & Performing Arts  17.21168    4.46222   3.857 0.000127 ***
## genreMystery & Suspense         -0.41306    2.47059  -0.167 0.867275    
## genreOther                       5.57374    3.73663   1.492 0.136304    
## genreScience Fiction & Fantasy  -3.15475    4.88831  -0.645 0.518930    
## log10_imdb_votes                 8.84334    0.95505   9.260  < 2e-16 ***
## dvd_year                        -0.50655    0.15354  -3.299 0.001025 ** 
## mpaa_ratingNC-17                -8.80315    9.88545  -0.891 0.373537    
## mpaa_ratingPG                   -1.54070    3.68191  -0.418 0.675763    
## mpaa_ratingPG-13                -3.70767    3.81708  -0.971 0.331761    
## mpaa_ratingR                    -1.45316    3.67916  -0.395 0.693002    
## mpaa_ratingUnrated               3.38388    4.30125   0.787 0.431748    
## best_pic_nomyes                  6.63137    3.04739   2.176 0.029929 *  
## best_actress_winyes             -2.63755    1.69973  -1.552 0.121236    
## best_actor_winyes               -2.38739    1.52341  -1.567 0.117596    
## thtr_year                       -0.09905    0.07328  -1.352 0.177010    
## title_typeFeature Film          -6.79801    4.93416  -1.378 0.168782    
## title_typeTV Movie             -10.91360    7.62364  -1.432 0.152780    
## best_dir_winyes                 -2.42895    2.14086  -1.135 0.256998    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.96 on 616 degrees of freedom
## Multiple R-squared:  0.6025, Adjusted R-squared:  0.5864 
## F-statistic: 37.35 on 25 and 616 DF,  p-value: < 2.2e-16

Our final multiple linear regression model lm_final includes 12 of the original movies dataset’s variables, and excludes three (runtime, best_pic_win, and top200_box) that we had considered in the full model but that did not create sufficient value for our optimized version.

The 12 remaining variables consist of the numeric response variable along with 4 numeric, 4 two-level categorical, and 3 multi-level categorical explanatory variables. I’ll note that its \(R^2adj.\) value of \(0.5864\) improves only very slightly (by about \(0.16\) of a percentage point) over the full model’s \(0.5848\), and does so without signficantly simplifying the model. So in real practice, given the danger of overfitting to our particular sample, and given the desire to have a model that’s relative easy to explain to producers, marketers, media buyers, etc., I’ll reiterate that we might opt, ultimately, for a more parsimonious model that would exclude some of the later additions – say, those after the fourth or fifth round – that only offered marginal improvements.


Part 5: Prediction

In equation form, our model for prediction audience_score can be expressed (with some rounding) as:

\[\begin{aligned} \widehat{audience\_score} = &15.6 + 0.4 \times {critics\_score} + 7.1 \times {genreAnimation}\\ &+ 15.2 \times {genreArt\ House\ \&\ International} + 2.7 \times {genreComedy} \\ &+ 16.5 \times {genreDocumentary} + 7.3 \times {genreDrama} - 5.5 \times {genreHorror}\\ &+ 17.2 \times {genreMusical\ \&\ Performing\ Arts} - 0.4 \times {genreMystery\ \&\ Suspense} \\ &- 3.2 \times {genreScience\ Fiction\ \&\ Fantasy} + 1.2 \times {genreOther} \\ &+ 8.8 \times {log10\_imdb\_votes} - 0.5 \times {dvd\_year}\\ &- 1.5 \times {mpaa\_ratingPG} - 3.7 \times {mpaa\_ratingPG\mbox{-}13} - 1.5 \times {mpaa\_ratingR} \\ &- 8.8 \times {mpaa\_ratingNC\mbox{-}17} + 3.4 \times {mpaa\_ratingUnrated} \\ &+ 6.6 \times {best\_pic\_nomyes} - 2.6 \times {best\_actress\_winyes} - 2.3 \times {best\_actor\_winyes} \\ &- 0.1 \times {thtr\_year} -6.8 \times {title\_typeFeature\ Film} - 10.9 \times {title\_typeTV\ Movie} \\ &- 2.4 \times {best\_dir\_winyes} \end{aligned}\]

…where all categorical variables (genre, MPAA rating, title type, Oscar nominations/wins) have been broken out into levels that take the value of either \(0\) or \(1\).

So, for example, we would expect a PG-rated documentary released both theatrically and on DVD in 2014 with 90,000 votes on IMDB and a Rotten Tomatoes critics score of 92 to receive a Rotten Tomatoes audience score of:

\[\begin{aligned} &15.6 + 0.4 \times 92 + 0 + 0 + 0 + 16.5 + 0 + 0 + 0 + 0 + 0 + 0 \\ &+ 8.8 \times log_{10}90000 - 0.5 \times (2014 - 1991) - 1.5 + 0 + 0 + 0\\ &+ 0 + 0 + 0 - 0.1 \times (2014 - 1972) + 0 + 0 + 0 = 96.8 \end{aligned}\]

Or, using R’s predict() function, we can test our model on some real world examples from outside the dataset.

Let’s start with I Am Not Your Negro: a 2017 film directed by Raoul Pecke about American author James Baldwin, and a film not that far off from the hypothetical example I just gave: an Oscar-nominated documentary (though not for overall best picture) with a Rotten Tomatoes critics score of 98 and 13,320 votes on IMDB.

I Am Not Your Negro, Magnolia Pictures, 2017

I Am Not Your Negro, Magnolia Pictures, 2017

Using our model and assuming we want to achieve a 95% confidence interval, we get the following results:

# Create data frame with movie's information

i_am_not <- data.frame(critics_score = 98, genre = "Documentary", log10_imdb_votes = log10(13424), dvd_year = (2017 - dvd_min), mpaa_rating = "PG-13", best_pic_nom = "no", best_actress_win = "no", best_actor_win = "no", thtr_year = (2017 - thtr_min), title_type = "Documentary", best_dir_win = "no")

# Pass these values to the predictive model, and also construct the confidence interval around the fitted result.

predict(lm_final, i_am_not, interval = "prediction", level = 0.95)
##        fit      lwr      upr
## 1 83.81297 57.70457 109.9214

The model predicts a Rotten Tomatoes audience score of 83.8 for a film sharing the qualities of I Am Not Your Negro, and predicts with 95% confidence that the score would fall between 57.7 and 109.9 (though 100 is, of course, the hard upper limit for the score). And in fact, as of March 2, 2019 I Am Not Your Negro holds a Rotten Tomatoes audience score of 83. Pretty close!

Two other examples – both famous/infamous within the Rotten Tomatoes universe – might be interesting to play through our model: 2017’s Lady Bird (a coming-of-age comedy that for many weeks after its release held a 100% no-rotten-tomatoes score), and 2018’s Gotti (the mobster bio-pic drama that as of this writing continues to have a perfect all-rotten-tomatoes score of zero “fresh” ratings).

The current audience score for Lady Bird on Rotten Tomatoes – a movie with an almost perfect critics score, a large number of IMDB votes, a comedy, and a nominee for Best Picture – is what seems to me a surprisingly low 79. But what would the model predict?

Lady Bird, A24, 2017

# Create data frame with movie's information. 
# NOTE - I'm calling this a comedy, because it's very funny. IMDB classes it as "Comedy, Drama."

lady_bird <- data.frame(critics_score = 99, genre = "Comedy", log10_imdb_votes = log10(169117), dvd_year = (2018 - dvd_min), mpaa_rating = "R", best_pic_nom = "yes", best_actress_win = "no", best_actor_win = "no", thtr_year = (2017 - thtr_min), title_type = "Feature Film", best_dir_win = "no")

# Pass these values to the predictive model, and also construct the confidence interval around the fitted result.

predict(lm_final, lady_bird, interval = "prediction", level = 0.95)
##        fit      lwr      upr
## 1 81.72055 55.21045 108.2307

I may have been surprised by the actual 79 score, but the model wasn’t: it predicts an audience_score of 81.7, within the 95% confidence interval between 55.2 and 108.2.

And what about Gotti: a drama with an ultra-low critics score, a low number of IMDB votes, and nothing in the way of Oscar nominations (and hence no possible wins)? On Rotten Tomatoes, the audience score currently stands at 48. What does the model predict?

Gotti, Vertical Entertainment, 2018

# Create data frame with movie's information. 
# NOTE - I'm calling this a drama, because it's meant to be. But some might put it in the so-bad-it's-comedy category.

gotti <- data.frame(critics_score = 0, genre = "Drama", log10_imdb_votes = log10(8845), dvd_year = (2018 - dvd_min), mpaa_rating = "R", best_pic_nom = "no", best_actress_win = "no", best_actor_win = "no", thtr_year = (2018 - thtr_min), title_type = "Feature Film", best_dir_win = "no")

# Pass these values to the predictive model, and also construct the confidence interval around the fitted result.

predict(lm_final, gotti, interval = "prediction", level = 0.95)
##        fit      lwr      upr
## 1 31.33446 5.510647 57.15828

In this case, actual audiences are more generous than our model, which predicts with 95% confidence that a film like Gotti would receive an audience score between 5.5 and 57.2, but with a predicted value of 31.3 pretty far below the actual current Rotten Tomatoes audience score of 48.


Part 6: Conclusions & Further Directions

The linear model we’ve been able to create for understanding audience_score from other dimensions of the movies dataset is sensitive. Even for outlier films like Lady Bird and Gotti, we can establish a significant correlation to explanatory variables such as the year of the film’s DVD release and the number of votes racked up for the movie on IMDB, and we can use that correlation to make predictions about audience_score with 95% confidence.

And yet, while sensitive, the model is not terribly specific. A 50-plus-point spread on what is, after all, only a 100-point scale is a pretty broad confidence interval. And note that while the intervals for Lady Bird (with a low-end cutoff of 55.2) and Gotti (with a high-end cutoff of 57.2) overlap by only 2 points, they do overlap. Our model can’t entirely differentiate heroes from zeroes.

Residuals

Performing some diagnostic tests on our regression model demonstrates this.

On the one hand, our model fares pretty well in terms of linearity, with the histogram below showing an acceptably normal distribution of residual differences between the values the model predicts and the actual values for audience_score recorded in the 642 observations in our cleaned dataset.

# For ease of use, add predicted and residuals values for each observation
# to movies ("m") dataset itself

m$predicted <- predict(lm_final)
m$residual <- residuals(lm_final)

# Show the distribution of residual values generated by the linear model
# (Uses RColorBrewer palette set up at the beginning of this project)


# Here, and in what follows, I've borrowed ideas for visualizing residuals from others.
# For some, see https://drsimonj.svbtle.com/visualising-residuals
# and https://community.rstudio.com/t/ggplot-makes-residual-plots/738/2

gp <- ggplot(m, aes(x = residual))
ph <- geom_histogram(binwidth = 5, alpha = 0.4, color = pal[5], fill = pal[4])
th <- theme_minimal()
lb <- labs(x = NULL, y = NULL, title = "Count of Residual (Actual Minus Predicted) Values in Final Predictive Model")

gp + ph + th + lb

This linearity is confirmed by the apparent random scatter of residual values around zero across our collection of observations (in the presumably random order presented by our dataset) – with what seems to be an approximately even number of over- and under-estimated scores.

# Show the scatter distribution of residuals across the range of observations
# (in order presented by dataset)

gp <- ggplot(m, aes(x = 1:length(m$residual), y = residual))
pp <- geom_jitter(alpha = 0.5, color = pal[5])
hl <- geom_hline(yintercept = 0, color = "black")
lb <- labs(x = "Observations (in dataset order)", y = "Residual (Actual minus Predicted)", title = "Distribution of Residuals across Observations")

gp + pp + hl + th + lb

However, the fan shape we see in the residual values when looked at relative to the range of our model’s predicted values suggests there is some reason to be concerned by our model’s heteroscedasticity: simply put, our residuals are greater – and our model less successful at predicting audience_score – at the lower ranges of predicted audience scores. That is to say, our model finds it harder to predict how much audiences will say they don’t like the movies that we predict they won’t particularly like.

# Show the scatter distribution of residuals across the range of predicted values

gp <- ggplot(m, aes(x = predicted, y = residual))
lb <- labs(x = "Predicted Audience Score", y = "Residual (Actual minus Predicted)", title = "Distribution of Residuals across Range of Predicted Values")

gp + pp + hl + th + lb

This points us back to our third research question:

Does Genre Matter?

We might well ask, especially given some of the question marks presented by the non-specificity and heteroscedasticity of our model, whether the model works equally well for different kinds of movies. In other words, given the different reasons that audiences seek out and enjoy different kind of movies, is there any reason to think the same set of predictors and the weight given to each would equally successfully target both dramas and comedies, say, or both documentaries and action movies?

The tale told by the residuals alone doesn’t get us very far. Here’s what our earlier chart of residuals across observations looks like, broken out categorically by genre.

# Set up some of the framework we'll use for the next set of charts

pt <- geom_jitter(alpha = 1.0)
hl <- geom_hline(yintercept = 0, color = "Black")
th <- theme_minimal()

# Plot residuals across observation, distinguishing genre by color

gp <- ggplot(m, aes(x = 1:length(m$residual), y = residual, color = genre)) 
col <- scale_color_brewer(palette = "Set3")
lb <- labs(x = "Observations (in dataset order)", y = "Residual (Actual minus Predicted)", title = "Distribution of Residuals across Observations, by Genre", color = "Genre")

gp + col + pt + hl + th + lb

We know from our boxplot at the beginnning of this study (using the same set of colors) that there was variation in not only the median audience_score by genre, but also in the spread of values, with the standard deviation of scores ranging from a tight 8.4 for Documentary to a very broad 27.2 for Science Fiction & Fantasy.

But it’s hard to see that here. And even zeroing in on the residual scatter of just two genres within the whole – drama and comedy – there is nothing readily apparent that jumps out when looking across the range of observations in dataset order.

# Create level variables for drama and comedy to allow for their visualization

m$DramOrCom <- as.factor(
  ifelse(m$genre == "Drama", "Drama",
         ifelse(m$genre == "Comedy", "Comedy",
                "Other")))

# Show the scatter distribution of residuals across the range of observations
# (in order presented by dataset) for Comedy and Drama

gp <- ggplot(m, aes(x = 1:length(m$residual), y = residual, color = DramOrCom))
col <- scale_color_brewer(palette = "Set2")
lb <- labs(x = "Observations (in dataset order)", y = "Residual (Actual minus Predicted)", title = "Distribution of Residuals across Observations", subtitle = "(Drama and Comedy highlighted)", color = "Genre")

gp + pt + col + hl + th + lb

But to look at the residuals across the low-to-high range of audience_score predicted by our model by genre reveals something quite different, and instructive, particularly when taking the question of heteroscadisticity in account. We can see, for example, that the scores for dramas lean toward the higher end of the scale, but follow the basic pattern of the overall dataset in terms of increasing in predictive accuracy at the higher end of scores.

# Create level variable

m$DramYes <- as.factor(
  ifelse(m$genre == "Drama", "Drama",
         "Other"))

# Plot

gp <- ggplot(m, aes(x = predicted, y = residual, color = DramYes))
col <- scale_color_brewer(palette = "Set1")
lb <- labs(x = "Predicted Audience Score", y = "Residual (Actual minus Predicted)", title = "Distribution of Residuals across Observations", subtitle = "(Drama highlighted)", color = "Genre")

gp + pt + col + hl + th + lb

For comedies, on the other hand, the scores tend toward the lower end, and the predictions stand out from the general patten in that they do not show much improvement as scores rise, remaining broadly scattered into the 60s.

m$ComYes <- as.factor(
  ifelse(m$genre == "Comedy", "Comedy",
         "Other"))
         
gp <- ggplot(m, aes(x = predicted, y = residual, color = ComYes))
lb <- labs(x = "Predicted Audience Score", y = "Residual (Actual minus Predicted)", title = "Distribution of Residuals across Observations", subtitle = "(Comedy highlighted)", color = "Genre")

gp + pt + col + hl + th + lb

Documentary scores cluster near the higher end, and are fairly homogenous (remember, the standard deviation within the genre’s actual scores is only 8.4). The predicted scores occupy a a wider range than the actual scores, and thus their accuracy falls off in opposite directions of either side of 80, with predictions below range that tending to underestimate the actual socre, and predictions above that tending to overshoot.

m$DocYes <- as.factor(
  ifelse(m$genre == "Documentary", "Documentary",
         "Other"))

gp <- ggplot(m, aes(x = predicted, y = residual, color = DocYes))
lb <- labs(x = "Predicted Audience Score", y = "Residual (Actual minus Predicted)", title = "Distribution of Residuals across Observations", subtitle = "(Documentary highlighted)", color = "Genre")

gp + pt + col + hl + th + lb

And even more than comedy, predictability remains elusive across most of the predictive range of our model for Action & Adventure films.

m$ActionYes <- as.factor(
  ifelse(m$genre == "Action & Adventure", "Action & Adventure",
         "Other"))

gp <- ggplot(m, aes(x = predicted, y = residual, color = ActionYes))
lb <- labs(x = "Predicted Audience Score", y = "Residual (Actual minus Predicted)", title = "Distribution of Residuals across Observations", subtitle = "(Action & Adventure highlighted)", color = "Genre")

gp + pt + col + hl + th + lb

All this is just to say that the various genres have very different profiles within our model. As Netflix knows when it makes recommendations to me “Because You Watched…,” and as a whole sub-industry built around recommender systems has to assume, aesthetics are a niche game: one person’s sci-fi treasure is another person’s action-and-adventure trash. And the ways that audiences approach genres in different ways and with differing expectations is likely poorly served by our one-size-fits all approach here.

A Better Model?

Indeed, building an optimized model for just two of the genres for which our original dataset has at least 30 movies suggests how a more focused approach can yield more robust results. For example, a model built solely for “Mystery & Suspense” yields an \(R^2adj.\) value of \(0.7258\). This significantly outperforms the model we built for the full dataset, for which our final optimization brought us up only to an \(R^2adj.\) value of \(0.5864\).

# Build sub-datasets for two genres: mystery and comedy

m_mystery <- m %>% filter(genre == "Mystery & Suspense") %>% select(-genre)
m_comedy <- m %>% filter(genre == "Comedy") %>% select(-genre)

# Train and optimize a model for the "mystery" sub-dataset (trials not shown)

lm_mystery <- lm(data = m_mystery, audience_score ~
                   critics_score + log10_imdb_votes + dvd_year +
                   best_pic_nom + thtr_year + mpaa_rating + 
                   best_actress_win)

summary(lm_mystery)
## 
## Call:
## lm(formula = audience_score ~ critics_score + log10_imdb_votes + 
##     dvd_year + best_pic_nom + thtr_year + mpaa_rating + best_actress_win, 
##     data = m_mystery)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.732  -6.750   0.000   5.684  27.381 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          19.38629   14.86594   1.304 0.198301    
## critics_score         0.47549    0.05823   8.165 1.07e-10 ***
## log10_imdb_votes      8.43983    2.36743   3.565 0.000824 ***
## dvd_year             -1.79959    0.41369  -4.350 6.88e-05 ***
## best_pic_nomyes      29.84699    9.59981   3.109 0.003123 ** 
## thtr_year             0.72641    0.24289   2.991 0.004346 ** 
## mpaa_ratingPG       -18.32241   11.75073  -1.559 0.125372    
## mpaa_ratingPG-13    -21.59416   12.03057  -1.795 0.078833 .  
## mpaa_ratingR        -23.52615   11.23209  -2.095 0.041405 *  
## best_actress_winyes  -5.43986    3.91498  -1.390 0.170962    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.08 on 49 degrees of freedom
## Multiple R-squared:  0.7684, Adjusted R-squared:  0.7258 
## F-statistic: 18.06 on 9 and 49 DF,  p-value: 9.636e-13

The particular collection of significant variables didn’t change much – critics’ avg. score, number of votes on IMDB, year of release (both DVD & theatrical), “Best Picture” nomination – but the balance of each provided by the coefficient did.

And the impact on residuals is to essentially eliminate the heteroscedasticity we’ve seen in the low-end fan shape of our general model when comparing predicted to actual variables. Note that a regression line added to indicate the predicted/actual relationship for our Mystery & Suspense model is all but identical with our reference zero line.

# Add predicted values and residuals to model

m_mystery$predicted <- predict(lm_mystery)
m_mystery$residual <- residuals(lm_mystery)

# Plot
# Use mystery set

gp <- ggplot(m_mystery, aes(x = predicted, y = residual))
pt <- geom_jitter(alpha = 0.5, color = pal[5])
gs <- geom_smooth(method = "lm", alpha = 0.5)
hl <- geom_hline(yintercept = 0, color = "Black")
lb <- labs(x = "Predicted Audience Score", y = "Residual (Actual minus Predicted)", title = "Distribution of Residuals across Observations", subtitle = "(Mystery & Suspense)")


gp + pt + col + hl + th + lb + gs

A model designed around Comedy doesn’t improve on the general model’s \(R^2adj.\) value of \(0.5864\) quite as much as our Mystery & Suspense model does, though it comes in at a respectably improved \(R^2adj.\) value of \(0.6536\).

lm_comedy <- lm(data = m_comedy, audience_score ~ 
                  critics_score + log10_imdb_votes + dvd_year + 
                  mpaa_rating + runtime + title_type + thtr_year)

summary(lm_comedy)
## 
## Call:
## lm(formula = audience_score ~ critics_score + log10_imdb_votes + 
##     dvd_year + mpaa_rating + runtime + title_type + thtr_year, 
##     data = m_comedy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -28.2780  -6.8908   0.7441   6.6828  29.6036 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             57.65689   18.15573   3.176 0.002151 ** 
## critics_score            0.32077    0.05694   5.634 2.76e-07 ***
## log10_imdb_votes        14.19859    2.75564   5.153 1.93e-06 ***
## dvd_year                -0.74313    0.38394  -1.936 0.056595 .  
## mpaa_ratingPG            1.09953   12.36362   0.089 0.929366    
## mpaa_ratingPG-13         5.70422   12.77680   0.446 0.656524    
## mpaa_ratingR            11.56850   12.75817   0.907 0.367367    
## runtime                 -0.40421    0.11642  -3.472 0.000851 ***
## title_typeFeature Film -26.29421    8.72861  -3.012 0.003506 ** 
## thtr_year               -0.40397    0.22710  -1.779 0.079212 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.33 on 77 degrees of freedom
## Multiple R-squared:  0.6898, Adjusted R-squared:  0.6536 
## F-statistic: 19.03 on 9 and 77 DF,  p-value: 2.818e-16

As with our Mystery & Suspense model, the distribution of our Comedy model’s residuals across the range of prediction scores flattens out the heteroscedasity we saw when comedies were buried within the general model. Here, the variation of predictive error holds largely steady across the range, suggesting this generically specific model is, in fact, a more useful model than then general.

# Add predicted values and residuals to model

m_comedy$predicted <- predict(lm_comedy)
m_comedy$residual <- residuals(lm_comedy)

# Plot
# Use comedy set

gp <- ggplot(m_comedy, aes(x = predicted, y = residual))
pt <- geom_jitter(alpha = 0.5, color = pal[5])
gs <- geom_smooth(method = "lm", alpha = 0.5)
hl <- geom_hline(yintercept = 0, color = "Black")
lb <- labs(x = "Predicted Audience Score", y = "Residual (Actual minus Predicted)", title = "Distribution of Residuals across Observations", subtitle = "(Comedy)")


gp + pt + col + hl + th + lb + gs

Even more interesting, perhaps, is that this Comedy focused model mobilizes a somewhat different set of significant variables than the general: anything Oscar-nomination related is gone, release year (DVD or theatrical) are only just significant, and runtime emerges as a surprising, highly significant variable with a negative coefficient. The soul of wit is, indeed, brevity, it turns out, with comedies in danger of overstaying their welcome!

And since the same danger applies to academic exercises like this one, I’ll take the hint and end our explorations here. We’ve seen how an effective predictive model can be built through multiple regression linking the significanc, weight, and direction of a number of explanatory variables on the value of a response variable we’d like to better understand. And we’ve also seen some of the limitations of that model in terms of limited specificity and high heteroscedastity when those predictions attempt to cover observations that might be better grouped and focused to take into account structural differences within a dataset, pointing toward directions in which we might take our research further.

But now its time to butter up some popcorn, wrassle up a lap cat, and sit down for a Sunday night at the movies. So, let’s see, what was the running time on Paterson? And is it a drama, or comedy, or what, really?…